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ABSTRACT 

Group based anonymization is the most widely studied ap- 
proach for privacy preserving data publishing. This includes 
fc-anonymity, i-diversity, and f-closeness, to name a few. The 
goal of this paper is to raise a fundamental issue on the pri- 
vacy exposure of the current group based approach. This has 
been overlooked in the past. The group based anonymiza- 
tion approach basically hides each individual record behind a 
group to preserve data privacy. If not properly anonymized, 
patterns can actually be derived from the published data 
and be used by the adversary to breach individual privacy. 
For example, from the medical records released, if patterns 
such as people from certain countries rarely suffer from some 
disease can be derived, then the information can be used 
to imply linkage of other people in an anonymized group 
with this disease with higher likelihood. We call the derived 
patterns from the published data the foreground knowledge. 
This is in contrast to the background knowledge that the ad- 
versary may obtain from other channels as studied in some 
previous work. Finally, we show by experiments that the 
attack is realistic in the privacy benchmark dataset under 
the traditional group based anonymization approach. 
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A major technique used in privacy preservation data pub- 
lishing is group based anonymization, whereby records in the 
given relation are partitioned into groups and each group 
must ensure some property such as diversity so as to satisfy 
the privacy requirement while maintaining sufficient data 
utility. There are many privacy models associated with 
group based anonymization such as fc-anonymity [24], l- 
diversity [2f], f-closeness [17], (fc, e)-anonymity [30], Injector 
[18] and m-confidentiality [27]. It seems that this technique 
is sound for privacy preserving data publishing. However, 
when examined more carefully, they all suffer from one fun- 
damental privacy violation problem, which is overlooked in 
the past. The main cause of this problem is that the util- 
ity that is maintained in the anonymzied table can help the 
adversary to breach individual privacy. 

In the literature, background knowledge [21, 15, 22, 27, 
18] such as the rarity of a disease among a certain ethnic 
group or the pattern of age or gender for a disease can be 
used by the adversary to breach individual privacy. In this 
paper, we show that such knowledge can be mined from 
the published data or the anonymized data to compromise 
individual privacy. In fact, one of the main purposes of 
data publishing is data mining which is mainly about the 
discovery of patterns from the published data. 

Let us illustrate the problem with an example. Suppose 
a table T is to be anonymized for publication. Table T has 
two kinds of attributes, the quasi-identifier (QI) attributes 
and the sensitive attribute. The QI attributes can be used 
as an identifier in the table. [24] points out that in a real 
dataset, most individuals can be uniquely identified by three 
QI attributes, namely sex, date of birth and 5-digit zip code. 
The sensitive attribute contains some sensitive values. In 
our example. Table 1 is the given table T where one of the 
QI attributes is "Nationality" and the sensitive attribute is 
"Disease" containing sensitive values such as Heart Disease 
and HIV. Note that there can be other QI attributes in this 
table such as sex and zip code. For the sake of illustration, 
we list attribute "Nationality" only. Assume that each tuple 
in the table is owned by an individual and each individual 
owns at most one tuple. 

Suppose that we want to anonymize T and publish the 
anonymized dataset T* to satisfy some privacy require- 
ments. Typically, T* consists of a set of anonymized groups 
(in short, A-groups), where each A-group is a set of tuples 
with a multi-set of sensitive values that are linked with the 
A-group. Depending on the anonymization mechanism, each 
A-group may correspond to either a set of quasi-identifer 
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(b) Sensitive table 
A 2-diverse dataset anonymized from Ta- 



(QI) values or a single generalized QI value. An attribute 
GID is added for the ID of the A-group. Such an anonymized 
dataset is generated as a result of group-based anonymiza- 
tion commonly adopted in the literature of data publishing 
[3, 16, 29, 27, 18, 17] (including fc-anonymity, Z-diversity, 
t-closeness and a vast number of other privacy models). 

For illustration, a simplified setting of the Z-diversity 
model [21] is used as a privacy requirement for published 
data T* . An A-group is said to be l-diverse or satisfy l- 
diversity if in the A-group the number of occurrences of any 
sensitive value is at most l/l of the group size. A table sat- 
isfies Z-diversity (or it is Z-diverse) if all A-groups in it are 
Z-diverse. Table 2 satisfies 2-diversity. The intention is that 
each individual cannot be linked to a disease with a prob- 
ability of more than 0.5. However, does this table protect 
individual privacy sufficiently? 

Let us examine the A-group with GID equal to Li as 
shown in Table 2. We also refer to the A-group by Li. In Li, 
Heart Disease and Flu are values of the sensitive attribute 
Disease. It seems that each of the two individuals, Alex and 
Bob, in this group has a 50% chance of linking to Heart 
Disease (Flu). The reason why the chance is interpreted 
as 50% is that the analysis is based on this group locally 
without any additional information. 

However, from the entire published table containing mul- 
tiple groups, the adversary may discover some interesting 
patterns globally. For example, suppose the published table 
consists of many A-groups like L2 with all Japanese with no 
occurrence of Heart Disease. At the same time, there are 
many A-groups like L3 containing some Japanese without 
Heart Disease. The pattern that Japanese rarely suffer from 
Heart Disease can be uncovered. Note that it is very likely 
that such an anonymized data is published by conventional 
anonymization methods, given the fact that Heart Disease 
occurs rarely among Japanese. With the pattern uncovered, 
the adversary can say that Bob, being a Japanese, has less 
chance of having Heart Disease. S/he can deduce that Alex, 
being an American, has a higher chance of having Heart 
Disease. The intended 50% threshold is thus violated. 

1.1 Foreground Knowledge Attack 

The anonymized data can be seen as an imprecise or un- 
certain data [8, 9], and an adversary can uncover interest- 



ing patterns since the published data irmst maintain high 
data utility [29, 30, 27]. We call the uncovered patterns 
the foreground knowledge (which is implicitly inside the ta- 
ble) in contrast to the background knowledge, studied by 
existing works [21, 17, 30, 27], which the adversary requires 
much effort to obtain from somewhere outside the table. 
Since it is easy to obtain the foreground knowledge from the 
anonymized dataset, all existing works suffer from privacy 
breaches. 

In Table 2, there are only two local possible worlds for 
assigning the disease values to the two individuals in Li: 
(1) wi : Alex is linked to Heart Disease and Bob is linked to 

Flu and (2) W2 : Alex is linked to Flu and Bob is linked to 
Heart Disease. To construct a probability distribution over 
the domain of the real world, a simplest definition is based 
on the assumption that all the possible worlds are equally 
likely, or each world has the same probability. 

If we publish a group Li alone, the random world assump- 
tion is a good principle in the absence of other informa- 
tion. However, when several groups are published together 
as typically the case, the groups with Japanese contribute 
to a statement that their members are not likely linked to 
Heart Disease. This statement means that the probability 
(or weight) of the possible world wi is much greater than 
that of W2. 

Most previous privacy works such as Z-diversity [21], t- 
closeness [17], (fe, e)-anonymity [30] and m-confidentiality 
[27] adopt the random world assumption locally. In this 
paper, the source of attack of the adversary is to apply the 
more complete model of the weighted possible worlds. We 
call this kind of attack foreground knowledge attack. 

1.2 Contributions 

Our contributions can be summarized as follows. Firstly, 
we define and study data anonymization issues in data pub- 
lication with the consideration of foreground knowledge at- 
tack, which is ignored in the privacy literature. Secondly, 
we show how an adversary can broach privacy by comput- 
ing the probability that an individual is linked to a sensitive 
value by using foreground knowledge. 

Finally, we have conducted experiments to show how the 
adversary can succeed in foreground knowledge attack for 
four recent privacy models, namely Anatomy [29], MASK 
[27], Injector [18] and t-closeness [17]. 

We emphasize that, similar to Z-diversity, all privacy mod- 
els using group-based anonymization J29, 27, 18, 17] also 
suffer from possible privacy breaches due to the utility of 
the published table. We believe that this work is significant 
in pointing out this overlooked issue, and that all foUowup 
works should need to deter foreground knowledge attack. 

The rest of the paper is organized as follows. Section 2 
formulates the problem. Section 3 describe how the ad- 
versary can breach individual privacy with the foreground 
knowledge obtained from the anonymized data. Section 4 
shows how the adversary can obtain the foreground knowl- 
edge from the anonymized data. An empirical study is re- 
ported in Section 5. Section 6 reviews the related work. The 
paper is concluded in Section 7. 

2. PROBLEM DEFINITION 

Let T be a table. We assume that one of the attributes 
is a sensitive attribute X where some values of this at- 
tribute should not be linkable to any individual. The value 



of the sensitive attribute of a tuple t is denoted by t.X. 
A quasi-identifier (QI) is a set of attributes of T, namely 
Ai, A2, Aq, that may serve as identifiers for some individ- 
uals. Each tuple in the table T is related to one individual 
and no two tuples are related to the same individual. 

Let P be a partition of table T. Wc give a unique ID 
called GID to this partition P and append an additional at- 
tribute called GID to this partition where each tuple in P 
has the same GID value. Existing group-based anonymiza- 
tion defines a function /? on P to form an A-group such 
that the linkage between the QI attributes and the sensitive 
attribute in the A-group is lost. There are two ways in the 
literature for this task. One is generalization by generalizing 
all QI values to the same value. The other is bucketizaUon 
by forming two tables, called the QI table and the sensi- 
tive table, where P is projected on all QI attributes and 
attribute GID to form the QI table, and on the sensitive 
attribute and attribute GID to form the sensitive table. A 
table T is anonymized to a dataset T* if T* is formed by first 
partitioning T into a rmmbor of partitions, then forming an 
A-group from each partition by (3 and finally inserting each 
A-group into T* . For example. Table 1 is anonymized to 
Table 2 by bucketization. 

In known voter registration lists, the QI values can often 
be used to identify a unique individual [24, 16]. We assume 
that there is a mapping which maps each tuple in T to an 
A-group in T* . For example, the first tuple ti in Table 1 is 
mapped to A-group Li. 

In the following, for the sake of illustration, we focus on 
discussing the anonymized table generated by bucketization, 
instead of generalization. The discussion for generalization 
is same as that for bucketization. Specifically, generalization 
is similar to bucketization but generalization changes all QID 
values in a partition to the same "generalized" values. If the 
table is generated by generalization, each A-group contains 
the same "generalized" values. In the worst case scenario 
(which is a basic assumption in the privacy literature [22, 
27, 19]), the adversary can uniquely map each individual in 
an A-group by an (;xtcrnal table such as a voter registration 
list. After the mapping, each A-group contains individu- 
als with the original QID values, which becomes the case 
of bucketization. Thus, the discussion for bucketization still 
applies in the case for generalization. The worst case sce- 
nario assumption is essential in data publishing. Nobody 
can afford if the privacy of an individual is breached [19]. 
AOL published the dataset about search logs in 2006. After 
it realized that a single 62 year old woman living Georgia 
can be re-identified from the search logs by New York Times 
reporters, it withdraws the search logs and fired two employ- 
ers responsible for releasing the search logs [5]. 

In the literature [29, 27, 18, 17], it is assumed that 
the knowledge of the adversary includes (1) the published 
dataset T*, (2) the QI value of a target individual, (3) an 
external table T"^ such as voter registration list that maps 
QIs to individuals [24, 16]. We also follow these assumptions 
in our analysis. 

The aim of privacy preserving data publishing is to deter 
any attack from the adversary on linking an individual to 
a certain sensitive value. Specifically, the data publisher 
would try to limit the probability that such a linkage can be 
established. Let us consider an arbitrary sensitive value x 
for the analysis. We denote any value in X which is not x 
by X. 
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Heart Disease 


Not Heart Disease 


American 


0.1 


0.9 


Japanese 


0.003 


0.997 


French 


0.05 


0.95 



Table 3: A global distribution of attribute "Nation- 
ality" for our motivating example 

In this paper, wc consider that an adversary can obtain 
additional information from the published dataset T* in the 
form of global distribution, which can lead to individual pri- 
vacy breach. In the example in Section 1, we can mine from 
the published table that the chance of Japanese suffering 
from Heart Disease is low compared with American. This 
pattern is from the global distribution for the attribute set 
{"Nationality"}. 

Consider an axbitraxy sensitive value "Heart Disease". Ta- 
ble 3 shows the global distribution of attribute set {"Nation- 
ality"}, which consists of the probabilities that a Japanese, 
an American or a French is linked to Heart Disease. Each 
probability in the table is called a global probability. The 
sample space for each such probability consists of the possi- 
ble assignments of the values x and x to an individual with 
the particular nationality. 

Each possible value in attribute "Nationality" is called a 
signature. There are three possible signatures in our exam- 
ple: "Japanese", "American" and "French". In general, there 
are other attribute sets, such as {"Sex", "Nationality"}, with 
their correspondence global distributions. We define the sig- 
nature and the global distribution for a particular attribute 
set A as follows. 

Definition 1 (Signature). Let T* be the published 
dataset. Given a QI attribute set A with r attributes 
Ai, ...,Ar. A signature s of A is a set of attribute-value pairs 
{Ai,vi), {Ar,Vr) which appear in the published dataset 
T* , where Ai is a QI attribute and Vi is a value. A tuple t 
in T* is said to match s ift.Ai = Vi for all i = 1,2, r. 

For example, a signature s can be {("Nationality", "Amer- 
ican"), ("Sex", "Male")} if the attribute sot A is {"Nation- 
ality", "Sex"}. For convenience, we often drop the attribute 
names in a signature, and thus we refer to {"American", 
"Male"} instead of {("Nationality", "American"), ("Sex", 
"Male")}. The first tuple ti in Table 2(a) matches {"Ameri- 
can"} but the second tuple does not. 

Definition 2 (Global Distribution). Given an at- 
tribute set A, the global distribution G of A contains a set 
of entries {s : x,p) for each possible signature s of A, where 
p is equal to p{s : x) which denotes the probability that a 
tuple matching signature s is linked to x given the published 
dataset T* . 

For example, if G contains ("Japanese":"Heart Disease", 
0.003) and ("American":"Heart Disease", 0.1), then the prob- 
ability that a Japanese patient is linked to Heart Disease is 
equal to 0.003 while that of an American patient is 0.1. 

The global distribution G derived from the published 
dataset T* is called the foreground knowledge. We will de- 
scribe how the adversary derives G from the published table. 

Problem 1 (Foreground Knowledge). Given any 
arbitrary attribute set A, we want to find the global distri- 
bution G of A from published dataset T* . 



From Section 1, wc show that with the global distribution 
G of attribute set {"Nationality"}, we can deduce that the 
chance of Alex, an American, suffering from Heart Disease 
is high. Let t be Alex and x be Heart Disease. The chance 
can be formulated by p{t : x), the probability that t is linked 
to X given G. 



Problem 2 (Privacy Breach). Given a 
dataset T* , for any individual t, any sensitive value x and 
any attribute set A, we want to determine whether the prob- 
ability that t is linked to x denoted by p(t : x) is greater than 
1/r. Individual t is said to suffer from privacy breaches if 
the probability is greater than 1/r. 

In this paper, wc first study Problems 1 and 2. In Sec- 
tion 3, we will first give how we solve Problem 2 assuming 
that wc are given the foreground knowledge. Then, in Sec- 
tion 4, we will describe how we can mine the foreground 
knowledge from the published dataset T* for Problem 1. 
Wc shall show that the two problems arc intertwined, since 
the global probability is derived based on the published ta- 
ble, and thus depends on the probability p{t : x) for each 
tuple t. 



A 

tl,...,tN 
Si , S771 



p{tj : x) 
p{si : x) 

fi 

fi 

w 

m 

B 
p{w) 

Pj,w 
Cs, 

Lk{si) 
Ck{si : x) 



an A-group (anonymized group) in the 
anonymized dataset 

set of attributes e.g. {"Nationality", "Sex"} 
tuples in an A-group 

signatures for .4, e.g. {"American", "Male"} 
multiple tuples tj 's can map to the same Sj 

a sensitive value 

any value not equal to x 

probability that tuple tj is linked to value x 
probability that signature Sj is linked to x 
a simplified notation for p{si : x) 

a possible world: an assignment of the tuples 

in A-group Lj. to the sensitive values x and x 

set of all possible worlds w for Lj, 

set of all possible worlds w in Wfc 

in which tj is assigned value x. 

probability that w occurs given the anonymized 

dataset and based on A 

conditional probability that w occurs given 

A-group Lfc 

let tj be linked to 7 in ui, where 7 is a; or a; 
Pj.w is the probability that tj is linked to 7 
set of A-groui)S containing tuples matching Sj 
the set of tuples in Lj, matching Si. 
the expected number of tuples which match Sj 
and are linked to x in the A-group L}, 



Table 6: Notations 

3. FINDING PRIVACY BREACHES 

We assume that the attack is based on the linkage of an 
attribute set ^ to a sensitive value x. We denote by x any 
value not equal to x. In this section, wc assume that the 
global distributions G for A and x have been determined 
and we show how an adversary can use G to find privacy 
breaches. How the global distributions can be derived is 
explained in Section 4. 

Suppose there arc m possible signatures for attribute set 
A, namely si, S2, s™. The global distribution G of ^ is 
shown in Table 4. To simplify our presentation, the prob- 
ability that Si is hnked to x {x), p{si : x) (p(sf : x)), is 
denoted by /, (fi). 



Given G, the forunila for p{t : x), the probability that a 
tuple t is linked to sensitive value x, is derived here. Suppose 
t belongs to A-group L^- For the ease of reference, let us 
summarize the notations that we use in Table 6. We shall 
need the following definitions. 

Definition 3 (primitive events, projected events) 
A mapping t : 7 from an individual or tuple t to a sensitive 
value 7 (x or x) is called a primitive event. Suppose 
t matches signature s. Let us call an event for the 
coiTCsponding signature, "s : 7", a projected event for t. 

Hence, a primitive event is an event in the sample space 
for p{t : x), which is the probability of the interest for the 
adversary. A projected event is an event for p{s : x) which 
appears in the global distribution G. 

Definition 4 (possible world). Consider an A- 
group Lk with N tuples, namely ti,t2, ■■■,tN, with sen- 
sitive values 71,72, ...7^, where 7i is either x or x for 
i = 1,2,..., N. A possible world w for Lk is a possible 
assignment mapping the tuples in set {ti,t2, ...,tN} to 
values in multi-set {71,72, .■.7jv} in Lk. 

Given an A-group Lk with a set of tuples and a multi- 
set of sensitive values. For each possible world w, according 
to the global distribution G based on attribute set A, we 
compute the probability p{w) that w occurs. The sample 
space for p{w) consists of all the possible assignments of x 
or X to a set of N tuples with the same signatures as those 
in Lk. 

Suppose that in a possible world w for Lk, tuple tj is linked 
to 7, where 7 is either x or x. Let pj^^ be the probability 
that tj is linked to 7. 

Like [21, 29, 27], we assume that the linkage of a sensitive 
value to an individual is independent of the linkage of a 
sensitive value to another individual. For example, whether 
an American suffers from Heart Disease is independent of 
whether a Japanese suffers from Heart Disease. Thus, for a 
possible world w for Lk, the probability that w occurs is the 
product of the probabilities of the corresponding projected 
events for the tuples ti, ...tN in Lk. 



p{w) = pi,„ X P2,«, X ... X Pn,w 



(1) 



Suppose tj matches signature Si. If tj is linked to x in w, 
then pj^ui = fi. Otherwise, Pj,™ = fi. 

p{w) corresponds to the weight of w, which we mentioned 
in the introduction. 

The probability of Lk given T* is the sum of the proba- 
bilities of all the possible worlds consistent with T* for Lk. 
Let the set of these worlds be Wfc. For w € Wfc, we have 



p{w\Lk) = 



p{w) 



It is easy to verify that J2weWk 'PiM^k) = 1- 



(2) 



Our objective is to find the probability that an individual 
tj in Lk is linked to a sensitive value x. This is given by 
the sum of the conditional probabilities p{w\Lk) of all the 
possible worlds w where tj is linked to x. 



(3) 



= E»ss,P(w|ifc) 

where is a set of all possible worlds w in Wfc in which tj 
is assigned value x. 

One can verify that p{tj : x) -\-p{tj ■.x) = l. 
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Table 4: Global distribution 
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(a) Global distribution 
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X 
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0.5 X 0.5 X 0.8 X 0.8 = 0.16 


0.16/0.33 = 0.48 


'W2 
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X 
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0.5 X 0.5 X 0.2 X 0.8 = 0.04 


0.04/0.33 = 0.12 


w:i 
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0.5 X 0.5 X 0.8 X 0.2 = 0.04 


0.04/0.33 =0.12 
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0.5 X 0.5 X 0.2 X 0.8 = 0.04 


0.04/0.33 = 0.12 


wr, 
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X 
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0.5 X 0.5 X 0.8 X 0.2 = 0.04 


0.04/0.33 = 0.12 


We 


X 


X 


X 
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0.5 X 0.5 X 0.2 X 0.2 = 0.01 


0.01/0.33 = 0.03 



(b) p{w) and p{w\Lk) 



Table 5: An example illustrating the computation of p{tj : x) 



Example 1. Consider an A-group Lk in a published table 
T* . Suppose there are four tuples, t\_,t2,tz andt^, and four 
sensitive values, and X in Lk- Suppose the published 

table T* satisfies 2-diversity. 

Consider the global distribution G based on a certain QI 
attribute set A which contains two possible signatures si and 
S2 as shown in Table 5(a). 

Suppose ti,t2,t3 andt4 match signatures si, si, S2 ands2, 
respectively. There are six possible worlds w as shown in 
Table 5(b). For example, the first row is the possible world 
wi with mapping {ti : x, t2 : x, tz : x, ti : a;}. The table 
also shows the probability p{w) of the possible worlds. Take 
the first possible world W\ for illustration. From the global 
distribution in Table 5(a), p{s\ : x) — 0.5 and p{s2 : x) = 
0.8. Hence, p{wi) = 0.5 x 0.5 x 0.8 x 0.8 = 0.16. The sum 
of probabilities p{w) of all possible worlds from Table 5(b) 
is equal to 0.16 + O.O4 + O.O4 + O.O4 + O.O4 + 0.01 = 
0.33. Consider w-i again. Since p(wi) = 0.16, p{w\\Lk) = 
0.16/0.33 = 0.48. 

Suppose the adversary is interested in the probability that 
ti is linked to x. We obtain p{ti : x) as follows. Wi, W2 and 
W3, as shown in Table 5(b), contain "ti : x". Thus, p{ti : x) 
is equal to the sum of the probabilities p{wi\Lk),p{w2\Lk) 
and p(w3|ifc). p{ti : x) = 0.48 + 0.12 + 0.12 = 0.72 which 
is greater than 0.5, the intended upper bound for 2-diversity 
that an individual is linked to a sensitive value. g 

Let \Lk \ be the size of the A-group containing tj and |VVfc| 
be the number of possible worlds in an A-group Lk. We will 
generate |VVfc| possible worlds. For each possible world, we 
calculate p{w) and p{w\Lk) in 0(|Ljj|) time. Thus, the time 
complexity is 0(|Lfc| • |Wfc|). 

The time complexity depends on two factors. One is |Wfc| 
and another is \Lk\. (1) |VVfc| is equal to where n is 
the number of tuples with x in this A-group of size N and 
denotes the total number of possible ways of choosing n 
objects from N objects. Note that Wk is typically small be- 
cause n is usually equal to a small number. For /-diversity, 
algorithm Anatomy [29] restricts that each A-group contains 



either I or I + 1 tuples and each sensitive value x appears at 
most once. Here, n is equal to 1. Thus, for each possible 
x, \Wk\ is at most / -|- 1. For Algorithm MASK [27], in our 
experiment with 1 = 2, the greatest frequency of x in an A- 
group is 8. The size of this A-group is 23. |Wfc| is equal to 
= 490, 314. When / = 10, the greatest possible value of 
\Wk\ is 140,364,532. These values are small compared with 
the excessive number of possible worlds studied in uncertain 
data [14, 8, 9, 4, 11] (e.g., 10^°" in [4])). In the experimental 
setups in existing works [21, 29, 17, 27, 18], / < 10. In other 
words, Wfc can be generated within a reasonable time. (2) 
\Lk \ is bounded by the greatest size of the A-group which de- 
pends on the anonymization techniques. For example, \Lk\ 
is equal to I or I + 1 for algorithm Anatomy [29] restricting 
that each A-group contains either I or I + 1. In our exper- 
iment, |Lfc| is at most 23 for algorithm MASK [27] where 
I = 2. 

4. MINING FOREGROUND KNOWLEDGE 

We first describe how we find the global distribution G 
of a certain attribute set A from the anonymized data in 
Section 4.1. Next, we introduce a pruning strategy to prune 
our search space of attribute sets in Section 4.2. Finally, 
we describe the algorithm for finding the global distribu- 
tion of multiple attribute sets and discuss its complexity in 
Section 4.3. 

4.1 Foreground Knowledge 

In the previous section, we assume that the values of fi are 
given. Here we consider how to derive fi from the published 
table T*. We will develop m equations involving the m 
variables fi, 1 < i < m. 

Let the set of A-groups in T* be Li, Lu. Let Lk{si) be 
the set of tuples in Lk matching signature s;. For example, 
in Table 2, let Si = {"American"}. Then, Li(si) contains 
only the first tuple. 

Let be a set of A-groups containing tuples which 
match Si. That is, jCs^ = {Lk\Lk(si) ^ 0}. 
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Table 7: A 
raw table 



(a) QI Table 



(b) Sensitive Table 



Table 8: An example illustrating the 
computation of the global distribu- 
tion 

fi is equal to the oxpoctod number of tuples which match 
Si and are linked to x in T* divided by the number of tuples 
which match s; in T*. Let Ck{si : x) be the expected number 
of tuples which match Si and are linked to x in the A-group 
Lfc. Then, we can express fi as follows. 

Ei,s£3.Cfc(si :a;) 



fi 



\Lk{si) 



(4) 



The denominator is simply equal to the number of occur- 
rences of Si in T* and which can be easily found from the 
dataset. Let us consider the term cu{si : x) in the numera- 
tor. 

Without additional knowledge to govern otherwise, we as- 
sume that the event that a tuple matching Si in Lk is linked 
to X is independent of the event that another tuple also 
matching Si in Lk is linked to x. Then we have the follow- 
ing. 



Cfe(si : x) = |Lfc(si)| X p(tj : x) 



(5) 



where tj is any tuple in Lk matching s,. Note that any tj 
in Lk matching Si can be used here since all such p{tj : x) 
values are equal. Substitute Equations (3) and (2) into the 
above equation, we get 

p{w) 



Ck{si ■■ x) = \Lk{si)\ X 



(6) 



Hence, Ck{si : x) is expressed in terms of probabilities p{w) 
which in turn are expressed in the m variables /; (see Equa- 
tion (1) where pj w is equal to fi or /j). Here note that 

Ti = i-fi- 

There are m equations of the form of Equation (4) for 
the expression oi fi, 1 < i < rn. These equations involve 
m variables, fi. This is a classical problem of a system of 
simultaneous non-linear equations, which occurs in many ap- 
plications. It can be solved by conventional methods such as 
Newton's method and Bairstow's iteration. Since Newton's 
method [10] has been known to be effective and feasible, we 
choose this method for our study in this paper. 

Example 2. Given a table T containing six tuples, 
ti,t2, ...jta, as shown in Table 7. If the objective of the 
privacy requirement is 2- diversity, T does not satisfy 2- 
diversity. Thus, an anonymized dataset T* Table 8 with 
three A-groups, Li,L2 and L3, is published (for each sensi- 
tive value x and each A-group, the fraction of tuples with x 
is at most 0.5). Note that L3 satisfies 2-diversity because. 
Since X corresponds to a value not equal to x, in L3, the first 
X corresponds to a value y and the second x corresponds to 
another value z. 

Consider the global distribution of attribute set A. There 
are two possible signatures based on A, namely si and 82. 



Thus, we have two equations with two variables, namely fi 
and f2, the probabilities in the global distribution G of A as 
shown in Table 4- 

Consider f\ . Since only A-groups Li and L2 contain the 
tuples matching si, Cs^ = {Li,L2}. 

h = lT,L^eCs,Ckisi : x)]/[J2L^eCs, \Lk{si)\] 

Li contains one tuple ti matching si and L2 contains two 
tuples t3,t4 matching si, |Li(si)| = 1 and |L2(si)| = 2. 
Thus, 

/i = [1 X p(ti : a;) 2 X p(t3 : x)]/{l + 2) (7) 

Consider Li . There are only two possible worlds, wi = {ti : 
x,t2 : x} and W2 ~ {tj : x,t2 : x}. Note that ti and t2 
match signatures s I ands2, respectively. pi,wi = fi,P2,wi = 
f2'Pi,'W2 = fj^a-nd P2,W2 = /2- Thus, p{w{)_= Pi,nii x 
P2,wi = fi X h and p{w2) = pi,w2 x P2,W2 = /i x /2- We 
derive that 

p{ti : x)=p{wi\Li) = fih/ifih + hf2) 

Similarly, consider L2 . There are two possible worlds, W3 = 
{t3 : x,t4 : x} and UI4 = {ts :x,t4 : x}. Similarly, p{w3) = 
fi X /i and p{w4) = fi x fi- We have 

p{t3 : x)=p{w4\L2) = fih/ifih+hfi) = 1/2 

Prom (7), we obtain 

fi = [fih/{fih+Tif2) + i]/3 

= - /2)/(/l(l - /2) + (1 - /l)/2) + l]/3 

Similarly, since Li contains one tuple t2 matching S2 and 
L3 contains two tuples tsjte matching 82, 

f2 = [lxp{t2:x) + 2xp{t5:x)]/{l-\-2) 

= [7r/2/(/i^+7r/2)+o]/3 

= [(1 - /l)/2/(/l(l - /2) + (1 - /l)/2)]/3 

With the above two equations involving two variables, we 
adopt Newton's method to solve for these variables. 

Finally, we obtain fi = 0.66666;^ and = 0.000000. 
Thus, we derive fi = 0.333333 and f2 = 1.000000. □ 

4.2 Pruning Attribute Sets 

The adversary may choose to attack with as many at- 
tribute sets as possible. Although there are many attribute 
sets in the anonymized data, it is not always true that the 
global distribution of each attribute sot is reliable because if 
the global distribution derived is based on a small sample or 
a small set of tuples matching the same signature, the distri- 
bution is not accurate. For example, consider attribute set 
^="Nationality" and the signature {"American"}. Suppose 
there are only a few Americans, says 10 Americans, in the 
published table T*. Intuitively, 10 Americans cannot rep- 
resent a meaningful global distribution. Wo will make use 
of the sample size studied in the literature of statistics to 
determine whether the distribution is reliable or not. The 
adversary can launch an attack only based on reliable dis- 
tributions. 

Based on studios in statistics [25], we use the following 
theorem to determine the acceptable sample size (i.e., the 
size of the sot which contains the tuples matching the same 
signature s). Let S be a random sample of tuples for a 
signature s, and p be the expected fraction of tuples in S 



with the sensitive value x. Let phc the observed fraction of 
tuples with the sensitive value x in the sample S. Then the 
following theorem applies. 

Theorem 1 (Sample Size [25]). Given an error pa- 
rameter e > and a confidence parameter a > 0, if ran- 
dom sample S has size \S\ > j^ln^, the probability that 
\p — p\ > e is at most a . g 

In case the sample size is not enough to satisfy the error 
bound, then uniform distribution will be assumed. The sam- 
ple size satisfies the monotonicity property. Formally, with- 
out loss of generality, assume that there are u attributes, 
namely Ai, Let m € ^i, Vu e Let Wi) 
be the number of tuples with attributes {A\, .... Ai) equal to 
{v\....,Vi). Given a positive integer J, if t/(t;i, «;) < J, 
then y(v\, ...,Vi,Vi^\) < J. With the above monotonic- 
ity property, whenever we find that the sample size of 
y{vi, ...,Vi) is not large enough, we do not need to count 
the number of the tuples with values vi,...,Vi+i because 
y{vi, ...,Vi,Vi+i) is also not large enough. Thus, this can 
help to prune the search space. 

4.3 Algorithm 

In this section, we will describe how to compute the set Q 
of all global distributions of multiple attribute sets with the 
use of the sample size just described. The steps are shown 
in Algorithm 1. 



Algorithm 1 Computation of the global distributions 

• Step 1: For each attribute set A, we first identify the 

set Sa of signatures Si with respect to A where each Si 
is matched by some tuples in T* and has sufficient sam- 
ple size. For example, for A = {"Nationality", "Sex" }, 
a signature equal to { "American", "Male"} is matched 
by the first tuple in Table 2(a). If it has sufficient 
sample size, it is stored in Sa- 

• Step 2: For each attribute set A, if Sa is non-empty, 
we calculate the global distribution of A according to 
Sa for each sensitive value x. 



In the algorithm. Step 1 is to find all signatures with suf- 
ficient sample size for each attribute set A. Similar to fre- 
quent pattern mining, this step is typically computed within 
a reasonable time. Let a be the time for this step. After 

we have determined the sample sizes, Q is used to store the 
global distributions of all attribute sets each of which con- 
tains signatures with sufficient sample size. 

Step 2 is to calculate the global distribution of A according 
to non-empty Sa for each attribute set A. In other words, 
it is to find each global distribution in Q. As described in 
Section 4.1, for a particular global distribution, we formulate 
rn equations with m variables where rn is the total rmmber 
of signatures for A. The average number of terms in each 
equation is 0{N ■ \Wk\ ■ \^si\) where N is the average A- 
group size, \Wk\ is the average number of possible worlds 
in an A-group Lk and \£s. \ is the average number of A- 
groups with tuples matching a signature Si. If Newton's 
method takes /3 time to find a solution, the computation 
for a global distribution takes 0('m ■ N ■ \yVk\ ■ \Csi \ + P) 
time. Since there are \Q\ global distributions. Step 2 takes 
0{\g\ • (m • AT • IWfcl • \Cs, I + /?)) time. 



Thus, the total running time is 0(a + \C;\ ■ [m ■ N ■ \ Wk \ ■ 
\^Bi !+/?))• Note that the values of m, N, \Wk\ and \£.si \ are 
small and the complexity is dominated by |^| and /3. But, 
as the attribute set size increases, the sample size quickly 
becomes insufficient, and so \Q\ is typically well-behaved. 

From our experiments, in all of our cases. Step 2 with the 
system of rn equations can be solved in a relatively short 
time. So, /3 is also a reasonable value. For the benchmark 
dataset, adult, foreground knowledge can be mined within 
12 minutes in all our experiments. 

The probabilistic analysis is similar in nature to that stud- 
ied for uncertain databases [8, 9, 4] The computation com- 
plexity above is in fact much smaller than these previous 
works. In [4], all results are returned within 3 hours. The 
reason is that [8, 9, 4] analyze the possible worlds based on 
the entire uncertain table (which can be regarded as a single 
large A-group) while we analyze the possible worlds based 
on a single small A-group (which is typically smaller than 
the entire table). 

4.4 Discussion 

We have just discussed how to find the global distribution 
from the published table. One may argue that the global 
distribution Q found from the published table is just an ap- 
proximation of the true global distribution Qo found from 
the original table. Thus, the privacy breaches found in Sec- 
tion 3 according to Q are invalid. However, we disagree with 
this argument with the following reasons. 

Firstly, since the adversary does not have the true global 
distribution Qo (because s/he has not seen the original ta- 
ble), the best adversary's knowledge about the global distri- 
bution is ^. _ 

Secondly, an adversary with Q is more powerful and more 
sophisticated than another adversary without any knowledge 
about the global distribution. The former adversary is what 
we are studying in this paper and can breach individual 
privacy discussed in Section 3 while the latter adversary is 
the normal adversary studied in the privacy literature [21, 
29, 27] and cannot breach any individual privacy found by 
the former adversary. 

Thirdly, the adversary Ao with Qo (if there is) does not 
perform more serious privacy attacks compared with an ad- 
versary A with Q. We assume that an adversary Ao can have 
the true global distribution Qo. This means that the public 
can also know Qo and Qo is not secret information^ . 

Consider adversary A. Before s/he obtains Q, individual 
privacy (in the published table) is protected. After s/he 
obtains Q (which can be found fi:om the published table), 
individual privacy breaches. There is a change of belief af- 
ter s/he sees Q. There are two kinds of privacy breaches. 
The first one is that an adversary can guess correctly the 
true sensitive value of an individual. The second one is that 
s/he can guess incorrectly the true sensitive value of am in- 
dividual. For example, even if an individual is not linked 
to HIV in the original table, s/he can guess that the prob- 
ability that this individual is linked to HIV is very high. 
This is also considered as a privacy breach to this individ- 

'^If this is not true, one of the ways that adversary Ao can ob- 
tain Qo is to steal the original table from the data publisher. 
Since s/he has the original table, the privacy breaches found 
by Ao are more serious. In this paper, we are not studying 
that the adversary can steal the original table. 



ual. The reason is that the disclosure of the high hnkagc 
between this individual and HIV hurts the reputation of the 
individual because the adversary can convince a certain set 
of people that the inference procedure about individual pri- 
vacy breaches was reasonable. Thus, privacy breaches found 
by A are also serious. 

Consider adversary Ao. In this case, we know that Qo is 
a public information. Thus, the data publisher must have 
already taken Qo into the account to publish a table. The 
claim is true because, otherwise, no individuals are eager 
to disclose their information to data publisher. Thus, even 
if adversary Ao sees Qo, we cannot breach any individual 
privacy in the published table. 

5. EMPIRICAL STUDY 

A Pentium IV 2.2GHz PC with 1GB RAM was used to 
conduct our experiment. The algorithm was implemented in 
C/C++. We adopted the publicly available dataset. Adult 
Database, from the UCIrvine Machine Learning Repository 
[6]. This dataset (5.5MB) was also adopted by [16, 21, 26, 
12, 27]. We used a configuration similar to [16, 21, 27]. The 
records with unknown values were first eliminated result- 
ing in a dataset with 45,222 tuples (5.4MB). Nine attributes 
were chosen in our experiment, namely Age, Work Class, 
Marital Status, Occupation, Race, Sex, Native Country, 
Salary Class and Education. By default, we chose the first 
five attributes and the last attribute as the quasi-identifer 
and the sensitive attribute, respectively. Similar to [27], in 
attribute "Education", all values representing the education 
levels before "secondary" (or "9th-10th") such as "lst-4th", 
"5th-6th" and "7th-8th" are regarded as a sensitive value set 
where an adversary checks whether each individual is linked 
to this set more than l/r, where r is a parameter. 

There arc 3.46% tuples with education levels before "sec- 
ondary". We set e — 0.01 and a — 0.9 for sampling. That is, 
the allowed relative error of sampling is 1/3.46 = 28.90%, 
which is considered large. A larger allowed error means less 
attribute sets can be pruned. Since there is a set Q of mul- 
tiple global distributions G, we can calculate p{t : x) for 
different G's and different a;'s. We take the greatest such 
value to report as the probability that individual t is linked 
to some sensitive value since this corresponds to the worst 
case privacy breach. 

5.1 Privacy Breach in /-diverse Tables 

In this section, we will show that foreground knowledge 
attack is successful in the published data generated from 
the benchmark dataset, adult, by a well-known privacy algo- 
rithm. Anatomy [29]. We set I = r where I is the parameter 
of /-diversity used in Anatomy. We implemented the formula 
in Section 3 to calculate the probability of a privacy breach 
and the forrmila in Section 4 to find the global distribution 
from the published data. If a tuple which appears in the 
published data is identified as a privacy breach by our algo- 
rithm, it is said to be a problematic tuple. The tuples linking 
to sensitive values in the original table are called sensitive 
tuples. In this case study, we evaluate privacy breaches with 
three measurements: 

1. proportion of problematic tuples among sensitive tu- 
ples, (this is the recall in IR research). 

2. proportion of non-sensitive tuples which are identified 
wrongly as problematic tuples by our algorithm, 



3. the average probability by which individual privacy is 
breached among all sensitive tuples. 

We have conducted experiments with the variation of 
r and the variation of the QI size. (1) Variation of r: 
When r = 2 with default settings, the average probability 
that individual privacy breaches among all sensitive tuples 
is 0.8917(> 1/2). When r is increased to 4, it becomes 
0.4640(> 1/4). When r increases, there is a higher chance 
that a tuple forms an A-group with other tuples. Thus, the 
average size of A-groups is larger. Thus, the average prob- 
ability of privacy breaches decreases. We also studied the 
proportion of problematic tuples among all sensitive tuples 
and the proportion of non-sensitive tuples identified wrongly 
as privacy breaches. 

We found that, in most cases, more than 99% of sensi- 
tive tuples have privacy breaches and less than 6% of non- 
sensitive tuples are identified wrongly. (2) Variation of the 
QI size: When the QI size is equal to 3 with default set- 
tings where r = 2, the average probability causing privacy 
breaches is 0.80307. When the size is increased to 8, it be- 
comes 0.943526. This is because when there are more QI 
attributes, it is more likely that a QI attribute (or attribute 
set) gives a global distribution which can lead to privacy 
breaches. 

We also have a case study in the published data generated 
by Anatomy. Suppose the QI attributes chosen are Age, 
Marital Status and Occupation and the sensitive attribute 
is Education. In the original data, there are the following 2 

tuples. 



Age 


Marital Status 


Occupation 


Education 


39 


Never-married 


Adm-clcrical 


Bachelors 


20 


Married-civ-spousc 


Craft-repair 


5th-6th 



Suppose the objective of Anatomy is 2-divcrsity. Since 
"5th-6th" is a sensitive value. Anatomy forms an A-group 
containing these two tuples. However, from the global dis- 
tribution derived from the published data with respect to 
attribute Occupation, the probability that an individual 
with Occupation="Adm-clericar' is linked to a low educa- 
tion is only 0.02 but the probability that an individual with 
Occupation="Craft-repair" is linked to a low education is 
0.04. Since there is a significant difference in global dis- 
tribution of attribute Occupation, the probability that the 
second tuple above is linked to a low education is 0.67 (which 
is greater than 0.5). 

It is noted that the global distribution derived from 
the published data matches the real situation that "Adm- 
clerical" jobs require higher educations but "Craft-repair" 
jobs does not. In other words, the foreground knowl- 
edge can help the adversary to breach individual privacy. 
More specifically, let us check whether the real global dis- 
tribution derived from the original table is similar to the 
global distribution derived from the published data. From 
the original table, the probability that an individual with 
Occupation="Adm-clericar' is linked to a low education 
is only 0.01 but the probability that an individual with 
Occupation="Craft-repair" is linked to a low education is 
0.04. We observe that this global distribution is similar to 
that derived from the published data. 

With our default experimental setting using sufficient 
sample size, for 2-diversity, the average relative error of the 
global probabilities derived from the published data=0.7% 
which achieves 99.3% accuracy. For 10-diversity, the error 



increases to 5.26% where the accuracy is 94.74%. It shows 
that statistically the accuracy is very high. In other words, 
the foreground knowledge derived from the published data 
is quite accurate compared with the knowledge derived from 
the original table. 

In all our experiments, privacy breaches can bo found 
within 12 minutes, which shows that foreground knowledge 
attack can easily be realized. 

5.2 Privacy Breach in Other Privacy Models 

We studied privacy breaches with four algorithms. 
Anatomy [29], MASK [27], Injector [18] and t-doseness [17]. 
They are selected because they consider /-diversity or sim- 
ilar privacy requirements, so we need only set / = r. For 
Anatomy, we set I = r. For MASK, the parameters k and 
m used in MASK are set to r. For Injector, the parameters 
minConf , minExp and I are set to 1, 0.9 and r, respec- 
tively, which are the default settings in [18]. For t-closeness, 
similar to [17], we set t = 0.2. We evaluate the algorithms 
in terms of four measurements: (1) time for mining fore- 
ground knowledge, (2) execution time, (3) the proportion of 
problematic tuples among all sensitive tuples, (4) the aver- 
age of the greatest difference in the global probabilities in 
each A-group (In our figures, we label this as average value 
of A), and (5) the relative error ratio in answering an ag- 
gregate query as in [29, 27, 18] by the published data. For 
each measurement, we conducted the experiments 100 times 
and took the average. 

We do not report the time for finding privacy broaches 
because the time is very short (within a few minutes). For 
the sake of space, since the proportion of non-sensitive tuples 
identified wrongly for privacy breaches is small (less than 
10%), we do not report here. 

Let us explain measurements (4) and (5). (4) Consider 
an A-group Lk contains two tuples matching signatures Si 
and Sj , respectively. Suppose p{si : x) is the greatest global 
probabilities and p{sj : x) is the smallest in the A-group. 
The value of A in Lk is equal to p{si : x) — p{sj : x). The 
average value of A is taken among all A-groups and all at- 
tribute sets A with sufficient samples. (5) The relative error 
ratio measures the utility of the published data. We adopt 
all query parameters in [29, 27, 18]. For each evaluation, we 
performed 10,000 queries and reported the average relative 
error ratio. 

We have conducted the experiments by varying two fac- 
tors: (1) the QI size, and (2) r. 

Figure 1 and Figure 2 show the results when r is sot to 
2 and 10, respectively. Figure 1(a) shows that the time 
for mining foreground knowledge increases with the QI size 
because the algorithm needs to process more attribute sets. 
Figure 1(b) shows that the execution time increases with 
the QI size because the algorithms have to process more QI 
attributes. 

Figure 1(c) shows that the proportion of problematic tu- 
ples among sensitive tuples increases with QI size. With a 
larger QI size, there is a higher chance that individual pri- 
vacy breaches due to more attributes which can be used to 
construct the global distributions. MASK has fewer privacy 
breaches compared with Anatomy and Injector because the 
side-effect of the minimization of QI values in each A-group 
adopted in MASK makes the difference in the global distri- 
bution among all tuples in each A-group smaller. Thus, the 
number of individual with privacy breaches is smaller. It is 





Figure 1: Effect of QI size (r = 2) 
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Figure 2: Effect of QI size (r = 10) 



noted that there is no violation in t-closeness. The reason 

why t-closoness has no privacy broaches is due to the large 
A-groups formed by global recoding with respect to value 
r{= 2). The average size of the A-group in the table satisfy- 
ing t-closeness is at least 4000 and the utility of the table is 
low. It is noted that parameter t is independent of param- 
eter r. We will show that t-closeness has privacy breaches 
when r = 10. 

In Figure 1(d), when the QI size increases, the average 
value of A with respect to every attribute set increases, as 
shown in Figure 1(d). The average value of A is the largest 
in Anatomy and Injector, and the third largest in MASK. 
This is because Anatomy and Injector does not take the 
global distribution directly into the consideration for merg- 
ing but MASK does indirectly during the minimization of 
QI values. 

Figure 3(a) shows that the average relative error of t- 
closeness is the largest since it forms large A-groups by 
global recoding which introduce a lot of errors and thus re- 
duce the utility of the published data. 
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(a) r = 2 (b) r = 10 

Figure 3: Effect of QI size on average relative error 

We have also conducted experiments when r = 10 as 
shown in Figure 2. The results arc also similar. But, the 
time for mining foreground knowledge is larger. Since r is 
larger and thus 1/r is smaller, the average value of A is 
smaller when r = 10. Also, when r = 10, there are privacy 
breaches for t-closeness in Figure 2(c) because there is a 
higher privacy requirement when r = 10 and thus the size 
of the A-group is not large enough for protection. 

6. RELATED WORK 

With respect to attribute types considered for data 
anonymization, there axe two branches of studying. The 
first branch is anonymization according to the QI attributes. 
A typical model is A;-anonymity [3, 16]. The other branch 
is the consideration of both quasi-identifier attributes and 
sensitive attributes. Some examples are [21], [28], [17], [18] 
and [7]. In this paper, we focus on this branch. We want to 
check whether the probability that each individual is linked 
to any sensitive value is at most a given threshold. 

/-diversity [21] proposes a model where I is a positive inte- 
ger and each A-group contains I "well-represented" sensitive 
values. For t-closeness [17], the distribution in each A-group 
in T* with respect to the sensitive attribute is roughly equal 
to the distribution of the entire table T*. Given a real num- 
ber a £ [0, 1] and a positive integer k, {a, A;)-anonymity [28] 
maintains that, for each A-group L, the number of tuples 
in L is at least k and the frequency (in fraction) of each 
sensitive value in L is at most a. 

We emphasize that t-closeness is different from ours. 
Firstly, t-closeness does not have any privacy guarantee on 
the bound of breach probabilities. Like /-diversity, Anatomy 
and (a, fc)-anonymity, the major goal of privacy protection is 
to bound the probability that an individual is linked to a sen- 
sitive value at most a given threshold. However, t-closcncss 
has just an input parameter t expressing the bound on the 
closeness between the distribution in each A-group and the 
distribution of the entire table, which does not give any 
bound of breach probabilities. Secondly, t-closeness does not 
consider the QI attribute values for the distribution. Specif- 
ically, the distribution of an A-group (or the entire table) 
considered in t-closeness is the global distribution involving 
the probability that an individual (with any QI attribute 
values) is linked to a sensitive value. However, the global 
distribution studied here involves the probability that an in- 
dividual with particular QI attribute values such as Japanese 
is linked to a sensitive value. Thirdly, enforcing t-closeness 
gives a large distortion on the anonymized dataset. This 
is because it is usually the case that a small A-group has 
the distribution which is very different from the distribu- 
tion of the entire table. In order to satisfy t-closeness, a 



lot of A-groups should be merged to form a very large A- 
group, which makes the distortion large. Fourthly, there 
are not many useful patterns found in the table satisfying t- 
closeness. Like [29, 30, 27], one objective to publish the table 
is to analyze the correlation between some QI attributes and 
the sensitive attribute. Since t-closcncss restricts that each 
A-group has nearly the same distribution as the distribution 
of the entire table, the desired goal cannot be achieved. 

In the literature, different kinds of background knowledge 
are considered [21, 15, 22, 27, 20, 13, 18, 2]. [15] proposes 
the statistics of some attributes such as age and zipcode 
can be also available to the public. [22] considers another 
background knowledge in form of implications. [27] discov- 
ers that the minimality principle of the anonymization al- 
gorithm can also be used as a background knowledge. [20] 
proposes to use the kernel estimation method to mine the 
background knowledge from the original table. [13] describes 
that there are many tables published from different sources 
containing overlapping individuals. 

[18] finds that association rules can be mined from the 
original table and thus can be used for privacy protection 
during anonymization. In [2], the problem of privacy at- 
tack by adversarial association rule mining is investigated. 
Hence, the association rules are the foreground knowledge. 
However, as pointed out in [23], association rules used in [18] 
and [2] can contradict the true statistical properties. Also 
the solution in [2] is to invalidate the rules, but this will 
violate the data mining objectives of data publication. 

A recent work [1] proposes to generate a table in form 
of an uncertain data model. However, this work considers 
fc-anonymity which ignores any sensitive attribute. 

7. CONCLUSION 

In this paper, we point out a fundamental privacy breach 
problem which has been overlooked in the past. With the 
consideration of the utility of the anonymized table, group 
based anonymization suffers from privacy breaches. Our 

experiments show that existing well-known privacy models 
Anatomy, MASK, Injector and t-closeness suffer from se- 
rious privacy breaches in a benchmark dataset. For future 
work, we plan to study how to anonymize the data to defend 
against foreground knowledge attack. In our experiment, 
we observe that the chance of privacy breaches is lower if 
each group contains tuples with "similar" global probabili- 
ties. Thus, forming A-groups with "similar" tuples is one 
possible strategy. Another future work is to study the ef- 
fect of background knowledge that may be possessed by the 
adversary. 
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